The aim of this project is to investigate how the performance of elite runners is linked to factors like height, weight and age. This will be done by analysing historical data about athlete performances through history, mainly in modern Olympic running events. Understanding the relationship between these characteristics and athlete performance would be extremely helpful in developing training plans and improving athlete performance in the future.
Athlete performance is clearly affected by many factors, and this analysis will be limited to just a few of them, dictated mainly by the data available. The specific questions addressed here are:
There will be no particular modelling or machine learning in this analysis because the questions can be answered by visualising the statistics alone.
# Import libraries
import pandas as pd
import chardet # For character encoding
import ftfy # For fixing encoding issues
from matplotlib import pyplot as plt
from matplotlib import pylab as plb
from datetime import datetime, time
import numpy as np
from fuzzywuzzy import fuzz # For inexact ("fuzzy") string matching
from fuzzywuzzy import process
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters() # For future compatibility when plotting with datetime
This analysis will attempt some originality by combining three separate data sets. This allows athlete characteristics to be linked to athlete performances so any relationship between the two can be investigated.
First, load the data sets and briefly examine them.
The first data set is the Olympic Games results and athlete data, 1896-2016.
Source:
# Results data is in the first file:
all_olympics = pd.read_csv('datasets/athlete_events.csv')
all_olympics.head()
all_olympics.info()
Summary
The full Olympic Games data set contains useful information over a 120 year period about the competitiors (height, weight, age, country of origin, and medal, if they won one).
The second data set contains the Olympic track and field times and results. Note it only includes data for medal winners. Source:
# Data set 2 - Olympic track and field times and results. Source:
# https://www.kaggle.com/jayrav13/olympic-track-field-results/downloads/olympic-track-field-results.zip/1
# There is an additional column in a few of the rows. This is unlabelled so not useful in this analysis.
# Therefore, read explicitly labelled columns and disgard the unlabelled column.
ol_tf = pd.read_csv('datasets/results.csv', names=['Gender',
'Event',
'Location',
'Year',
'Medal',
'Name',
'Nationality',
'Result'])
ol_tf.drop(index=0, inplace=True)
ol_tf.head()
ol_tf.info()
Summary
The most useful feature of the track and field results is the detailed running times and event results. This will be linked to the full Olympic data (including its information on the athletes' characteristics) later in the analysis.
The third data set contains the top 1000 running performances for each running event.
Source:
https://www.kaggle.com/jguerreiro/running/downloads/running.zip/2
top_running = pd.read_csv('datasets/data.csv')
top_running.head()
top_running.info()
Summary
This data set is good because it contains a large number of data points (1000) including finish times for every running discipline. It is not limited to Olympic performances, but all the events are Olympic distances, with the exception of the half marathon.
print("Number of unique events is {}".format(len(all_olympics['Event'].unique())))
765 events is far too many to analyse. It also includes some events which have not taken place in the Olympics for a long time. This analysis is focussed on modern running events, so we will extract a subset of the results.
olympic_sports_groups = all_olympics.groupby('Sport')
athletics = olympic_sports_groups.get_group('Athletics')
all_athletics_events = athletics['Event'].unique()
all_athletics_events
This is a more manageable list of events. There are still some events here that don't exist in the modern Games. The next step is to remove any events that didn't take place in the most recent summer Games (2016).
modern_athletics_events = athletics[athletics['Year']==2016]['Event'].unique()
modern_athletics_events
removed_events = set(all_athletics_events).difference(modern_athletics_events)
removed_events
indices_to_remove = [athletics.index[i] for i in range(len(athletics)) if athletics['Event'].iloc[i] in removed_events]
modern_athletics = athletics.drop(index=indices_to_remove)
modern_athletics['Event'].unique()
This analysis will focus on individual running events. So, now remove the field events and non-running events.
# These are the events to keep for the analysis.
modern_individual_running_events = {"Athletics Women's 100 metres",
"Athletics Men's 1,500 metres",
"Athletics Men's 5,000 metres",
"Athletics Men's 110 metres Hurdles",
"Athletics Women's Marathon",
"Athletics Men's 100 metres",
"Athletics Men's 400 metres Hurdles",
"Athletics Men's 400 metres",
"Athletics Men's 800 metres",
"Athletics Men's Marathon",
"Athletics Men's 10,000 metres",
"Athletics Men's 200 metres",
"Athletics Men's 3,000 metres Steeplechase",
"Athletics Women's 200 metres",
"Athletics Women's 5,000 metres",
"Athletics Women's 10,000 metres",
"Athletics Women's 1,500 metres",
"Athletics Women's 800 metres",
"Athletics Women's 400 metres",
"Athletics Women's 400 metres Hurdles",
"Athletics Women's 100 metres Hurdles",
"Athletics Women's 3,000 metres Steeplechase"}
removed_events = set(modern_athletics_events).difference(modern_individual_running_events)
removed_events
indices_to_remove = [modern_athletics.index[i] for i in range(len(modern_athletics)) if modern_athletics['Event'].iloc[i] in removed_events]
ol_running = modern_athletics.drop(index=indices_to_remove)
ol_running.head()
# Check for missing values in each column.
ol_running.isnull().sum()
Many rows have no entry for a medal, and this is expected - many competitors do not win a medal, so there is no special treatment needed for missing values in the Medal feature. There are also a lot of missing values for height, weight and age, these will be examined now.
age_missing = ol_running[ol_running['Age'].isnull()]
weight_missing = ol_running[ol_running['Weight'].isnull()]
height_missing = ol_running[ol_running['Height'].isnull()]
age_missing.head()
weight_missing.head()
height_missing.head()
We now have three groups of rows that have at least one missing value. Now find out if they overlap by using sets:
age_missing_indices = set(age_missing.index)
weight_missing_indices = set(weight_missing.index)
height_missing_indices= set(height_missing.index)
print("The number of rows where both height and weight are missing is {}".format(
len(weight_missing_indices.intersection(height_missing_indices))))
print("The number of rows where both age and weight are missing is {}".format(
len(age_missing_indices.intersection(weight_missing_indices))))
print("The number of rows where both age and height are missing is {}".format(
len(age_missing_indices.intersection(height_missing_indices))))
print("The number of rows where age, height and weight are missing is {}".format(
len(age_missing_indices.intersection(height_missing_indices, weight_missing_indices))))
Of the rows where either height (2987) or weight (3131) are missing, most (2961) of them are missing both height and weight. Of the rows where age is missing (667), most (at least 504) are also missing either weight, height or both. The overlap between the missing data sets is large, which is good news, because it means more of the rows are fully populated, so more of this data is usable without dropping data or imputation. For now, all the data will be kept (not dropping rows with missing data).
The Event feature is a categorical variable. This will be encoded as follows:
This method of encoding is chosen because it groups together similar types of events (e.g., hurdles events are treated as a group, flat track events are treated as a separate group) and also separates them by the distance of each event (100m, 200m, etc.)
# Simple string processing in Event column
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Women's ", "")
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Men's ", "")
ol_running['Event'] = ol_running['Event'].str.replace(" metres", "")
ol_running['Event'] = ol_running['Event'].str.replace(",", "")
ol_running.head()
Adding the new columns, copying the values between columns and removing duplicates is repetetive so write a function for this:
def encode_events(df, col, to_replace, replacement):
"""
Helper function to insert new columns, copy and convert values to the correct column
"""
# Insert new column
df.insert(df.columns.get_loc('Event'), col, 0)
# Copy values across to new column
df.loc[df['Event'].str.contains(to_replace), col] = df['Event'].str.replace(to_replace, replacement)
# Remove values from original column
df.loc[df[col] != 0, 'Event'] = '0'
def string_to_int(df, features):
"""
Helper function to cast string values to integers.
"""
for feature in features:
df[feature] = pd.to_numeric(df[feature], downcast='integer')
new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = [' Hurdles', 'Marathon', ' Steeplechase']
replacement = ['', '42195', '']
for i in range(len(new_columns)):
encode_events(ol_running, new_columns[i], to_replace[i], replacement[i])
ol_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_running, columns_to_int)
The other two data sets refer to this as 'Gender'. For ease of comparison, change the name of this feature from 'Sex' to ''Gender'.
ol_running.rename(columns={'Sex': 'Gender'}, inplace=True)
For ease of comparison with the other data sets, convert 'Gold' to 'G', 'Silver' to 'S', and 'Bronze to 'B'
medals = ['Gold', 'Silver', 'Bronze']
short_medals = ['G', 'S', 'B']
for i in range(len(medals)):
ol_running.loc[ol_running['Medal'] == medals[i], 'Medal'] = ol_running[ol_running['Medal'] == medals[i]
]['Medal'].str.replace(medals[i], short_medals[i])
The 'Name' column also looks difficult to use:
ol_running['Name'].head()
There are alternative names/nicknames in parentheses and double quotes. The intention is to use the names later on, so to make this easier, remove sections in parentheses and double quotes, and convert the name string to lowercase. Make this a function so it can be used on the other data sets later on.
def process_names(df):
"""
Helper function to perform some cleaning on the athlete Name field.
"""
df.rename(columns={'Name': 'RawName'}, inplace=True)
df.insert(loc = df.columns.get_loc('RawName'), column = 'Name', value=np.NaN)
df['Name'] = df['RawName'].str.replace('\"(.*?)\"', '')
df['Name'] = df['Name'].str.replace('\((.*?)\)', '')
df['Name'] = df['Name'].str.lower()
process_names(ol_running)
ol_running.head()
Wrangling of this data set is complete, and from here on the cleaned data frame will always be called ol_running.
print("Number of unique events is {}".format(len(ol_tf['Event'].unique())))
ol_tf.head()
all_ol_tf_events = ol_tf['Event'].unique()
all_ol_tf_events
As in the previous section, this analysis will keep the individual running events and drop the remainder.
ol_tf_running_events = {'10000M Men', '100M Men', '110M Hurdles Men', '1500M Men',
'200M Men', '3000M Steeplechase Men',
'400M Hurdles Men', '400M Men', '5000M Men',
'800M Men', 'Marathon Men', '10000M Women', '100M Hurdles Women',
'100M Women', '1500M Women', '200M Women',
'3000M Steeplechase Women', '400M Hurdles Women', '400M Women',
'5000M Women', '800M Women', 'Marathon Women'}
indices_to_remove = [ol_tf.index[i] for i in range(len(ol_tf))
if not ol_tf['Event'].iloc[i] in ol_tf_running_events]
ol_tf_running = ol_tf.drop(index=indices_to_remove)
ol_tf_running.head()
ol_tf_running['Event'].unique()
This now contains the data of interest.
# Check for missing values in each column.
ol_tf_running.isnull().sum()
No missing values are shown but this is deceptive, since some of the 'Result' fields conatin the string 'None'.
ol_tf_running[ol_tf_running['Result'] == 'None'].head()
ol_tf_running.loc[ol_tf_running['Result'] == 'None', 'Result'] = pd.NaT
ol_tf_running.dropna(subset=['Result'], inplace=True)
The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.
# Simple string processing in Event column
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Women", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Men", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("M ", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace(",", "")
new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = ['Hurdles', 'Marathon', 'Steeplechase']
replacement = ['', '42195', '']
for i in range(len(new_columns)):
encode_events(ol_tf_running, new_columns[i], to_replace[i], replacement[i])
ol_tf_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
The aim is to convert the Results string to a datetime object, extract the time from this and store it in a feature called 'Time'. The time formats vary a lot in this data set so some cleaning is needed.
It's possible to create general groups of events that share similar formats.
# Hurdle events
ol_tf_running_hurdles_groups = ol_tf_running.groupby('Hurdles')
# Road running events
ol_tf_running_road_groups = ol_tf_running.groupby('Road')
# Steeplechase
ol_tf_running_steeplechase_groups = ol_tf_running.groupby('Steeplechase')
# Track (flat) events
ol_tf_running_trackf_groups = ol_tf_running.groupby('Track_Flat')
event_groups = [ol_tf_running_hurdles_groups,
ol_tf_running_road_groups,
ol_tf_running_steeplechase_groups,
ol_tf_running_trackf_groups]
for group in event_groups:
for event in list(group.groups.keys())[1:]: # Ignore the first event in each category where distance=0
print("Event: {}".format(event))
print(group.get_group(event)['Result'].head(3))
This shows it's possible to define three time formats in this result set:
# Time format for the sprint events
time_format_sprints = '%S.%f'
# Time format for middle distance events
time_format_middle = '%M:%S.%f'
# Time format for long distance events
time_format_long = '%H:%M:%S'
Examining each event in more detail shows that some further processing is needed.
Steeplechase
ol_tf_running_steeplechase_groups.get_group('3000 ')['Result'].head()
# Convert to datetime and extract the time part only.
ol_tf_running.loc[ol_tf_running['Steeplechase'] == '3000 ', 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Steeplechase'] == '3000 ']['Result'], format=time_format_middle).apply(datetime.time)
Hurdles
ol_tf_running_hurdles_groups.get_group('100 ')['Result'].head()
ol_tf_running_hurdles_groups.get_group('110 ')['Result'].head()
ol_tf_running_hurdles_groups.get_group('400 ')['Result'].head()
In addition, some of the time strings have a leading '0:':
ol_tf_running[ol_tf_running['Result'] == '0:54.0']
# Remove leading '0:':
ol_tf_running.loc[ol_tf_running['Hurdles'] == '400 ', 'Result'] = ol_tf_running[ol_tf_running['Hurdles'] == '400 ']['Result'].str.replace('0:', '')
# For all the Hurdles distances - convert to datetime and extract the time part only.
events = list(ol_tf_running_hurdles_groups.groups.keys())
events.remove(0) # Ignore the fist event in each category where distance=0
for event in events:
ol_tf_running.loc[ol_tf_running['Hurdles'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Hurdles'] == event]['Result'], format=time_format_sprints).apply(datetime.time)
Track (Flat)
ol_tf_running_trackf_groups.get_group('100')['Result'].head()
ol_tf_running_trackf_groups.get_group('200')['Result'].head()
ol_tf_running_trackf_groups.get_group('400')['Result'].head()
ol_tf_running_trackf_groups.get_group('800')['Result'].head()
ol_tf_running_trackf_groups.get_group('1500')['Result'].head()
ol_tf_running_trackf_groups.get_group('5000')['Result'].head()
ol_tf_running_trackf_groups.get_group('10000')['Result'].head()
Track events for distances less than 800m all have times written in the format defined in time_format_sprints. 800m and above use the format defined in time_format_middle.
sprint_distances = ['100', '200', '400']
middle_distances = ['800', '1500', '5000', '10000']
As with the hurdles distances above, remove any leading '0:':
# Remove leading '0:':
for event in sprint_distances:
ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Result'] = ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'].str.replace('0:', '')
# For the track sprint events - convert to datetime and extract the time part only.
for event in sprint_distances:
ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'], format=time_format_sprints).apply(datetime.time)
# For the track middle distance events - convert to datetime and extract the time part only.
for event in middle_distances:
ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'], format=time_format_middle).apply(datetime.time)
Road
ol_tf_running_road_groups.get_group('42195 ').head()
Some specific examples show there are several problems:
ol_tf_running_road_groups.get_group('42195 ').loc[[1379]]
ol_tf_running_road_groups.get_group('42195 ').loc[[1392]]
ol_tf_running_road_groups.get_group('42195 ').loc[[1417]]
This shows several formatting problems:
Taking these in turn:
# Remove 'h'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].str.replace('h', ':')
# Remove milliseconds:
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].str.replace('\..*', '')
# Replace '-' with ':'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].str.replace('-', ':')
ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].head()
There are also some values that only include hours and minutes:
ol_tf_running[ol_tf_running['Result'] == '2:32']
for i in ol_tf_running[ol_tf_running['Road'] == '42195 '].index:
if len(ol_tf_running['Result'].loc[i].split(':')) < 3:
ol_tf_running['Result'].loc[i] = ol_tf_running['Result'].loc[i] + ':00'
# Convert to datetime and extract the time part only.
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'], format=time_format_long).apply(datetime.time)
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_tf_running, columns_to_int)
Some names include nicknames in double quotes. There is also a string encoding problem causing some characters to be displayed wrongly. For example, 'Emil ZÃTOPEK' and 'Katrin DÃRRE' below:
ol_tf_running['Name'].loc[25]
ol_tf_running['Name'].loc[2322]
# Check encoding of the file
with open("datasets/results.csv", 'rb') as file:
print(chardet.detect(file.read()))
So chardet still suggests the file is utf-8 encoded. So we can try to clean this up by using the ftfy package to fix the bad encodings (Reference for ftfy: https://ftfy.readthedocs.io/en/latest/)
ol_tf_running['Name'] = ol_tf_running['Name'].apply(ftfy.fix_encoding)
ol_tf_running['Name'].loc[25]
ol_tf_running['Name'].loc[2322]
This shows the bad encodings have disappeared:
Other name text processing is the same as the previous section
# Use the processing function defined previously
process_names(ol_tf_running)
ol_tf_running['Name'].head()
Female athletes are categorised as 'W' in the 'Gender' column. Change this to be 'F' for consistency with the other data sets.
ol_tf_running['Gender'] = ol_tf_running['Gender'].str.replace('W', 'F')
This concludes cleaning of the second data set, which will be named ol_tf_running from here on.
Select individual running events as with the previous two data sets.
print("Number of unique events is {}".format(len(top_running['Event'].unique())))
top_running['Event'].unique()
These are all valid events for this analysis. No need to remove any.
top_running.isnull().sum()
There are a few missing 'Place' values. This anlysis will not use this feature and it will not be included any further analysis anyway. No further action on this for now.
top_running.head()
The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.
# Simple string processing in Event column
# Replace the race type strings ('Marathon', 'Half marathon') with their distance in metres:
racetype = ['Marathon', 'Half marathon']
distance = ['42195 Road', '21098 Road']
for i in range(len(racetype)):
top_running.loc[top_running['Event'] == racetype[i], 'Event'] = top_running[
top_running['Event'] == racetype[i]]['Event'].str.replace(racetype[i], distance[i])
top_running['Event'] = top_running['Event'].str.replace(",", "")
new_columns = ['Road']
to_replace = [' Road']
replacement = ['']
for i in range(len(new_columns)):
encode_events(top_running, new_columns[i], to_replace[i], replacement[i])
top_running['Event'] = top_running['Event'].str.replace(" m", "")
top_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
top_running.head()
# In the 'Date' column, the year will be used as one of the keys to merge the data sets.
# Therefore, create a separate 'Year' column and populate it.
top_running.insert(top_running.columns.get_loc('Date'), 'Year', 0)
top_running['Year'] = top_running['Date'].str.split("-", expand=True)[0]
#top_running.rename(columns={'Date': 'Year'}, inplace=True)
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Road', 'Year']
string_to_int(top_running, columns_to_int)
Convert the string into a datetime object. First look at what the different time formats used in each event are:
# Road running events
top_running_road_groups = top_running.groupby('Road')
# Track (flat) events
top_running_trackf_groups = top_running.groupby('Track_Flat')
event_groups = [top_running_road_groups,
top_running_trackf_groups]
for group in event_groups:
for event in list(group.groups.keys())[1:]: # Ignore the first event in each category where distance=0
print("Event: {}".format(event))
print(group.get_group(event)['Time'].head(3))
For the road running events, the time format is the same as already defined in time_format_long above. For the track events, most of the times have the same format ('%H:%M:%S.%f'), but there are occassional cases where the milliseconds field is missing. For these cases it's possible to use the infer_datetime_format feature of pandas.to_datetime().
# Convert strings in the Time column to datetime objects for the road running events.
# Convert to datetime and extract the time part only.
events = top_running['Road'].unique().tolist()
events.remove(0)
for event in events:
top_running.loc[top_running['Road'] == event, 'Time'] = pd.to_datetime(
top_running[top_running['Road'] == event]['Time'], format=time_format_long).apply(datetime.time)
# Convert strings in the Time column to datetime objects for the track running events.
# Convert to datetime and extract the time part only.
events = top_running['Track_Flat'].unique().tolist()
events.remove(0)
for event in events:
top_running.loc[top_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
top_running[top_running['Track_Flat'] == event]['Time'], infer_datetime_format=True).apply(datetime.time)
Looking at the names in this data set - they seem straightforward:
top_running['Name'].head()
Other name text processing is the same as the previous section
# Use the processing function defined previously
process_names(top_running)
To be consistent with the other data sets, change the possible values of the 'Gender' feature to be either 'M' or 'F' instead of 'Men' or 'Women'.
top_running['Gender'] = ['M' if top_running['Gender'].iloc[i]=='Men' else 'F' for i in top_running.index]
top_running.head()
Later in this analysis, times from this data set will be merged into the Olympic data set. To facilitate this, it is necessary to label which rows correspond to an Olympic Games. This will be done by comparing the 'Date' field of the result to the known dates of the Olympic Games.
# Convert dates to datetime format
top_running['Date'] = pd.to_datetime(top_running['Date'], infer_datetime_format=True)
# What's the earliest year in the top_running data set?
min(top_running['Year'].tolist())
So there is no need to look at years before 1962.
# List of dates of Olympic summer games
# Source: https://en.wikipedia.org/wiki/Summer_Olympic_Games
# Use format Year-month-day
olympic_dates = [
['1964-10-10', '1964-10-24'],
['1968-10-12', '1968-10-27'],
['1972-08-26', '1972-09-10'],
['1976-07-17', '1976-08-01'],
['1980-07-19', '1980-08-03'],
['1984-07-28', '1984-08-12'],
['1988-09-17', '1988-10-02'],
['1992-07-25', '1992-08-09'],
['1996-07-19', '1996-08-04'],
['2000-09-15', '2000-10-01'],
['2004-08-13', '2004-08-29'],
['2008-08-08', '2008-08-24'],
['2012-07-27', '2012-08-12'],
['2016-08-05', '2016-08-21']
]
olympic_dates_df = pd.DataFrame(olympic_dates, columns=['Start', 'End'],
index=[1964,
1968,
1972,
1976,
1980,
1984,
1988,
1992,
1996,
2000,
2004,
2008,
2012,
2016])
olympic_dates_df['Start'] = pd.to_datetime(olympic_dates_df['Start'], format='%Y-%m-%d')
olympic_dates_df['End'] = pd.to_datetime(olympic_dates_df['End'], format='%Y-%m-%d')
top_running.insert(loc=top_running.columns.get_loc('Date'), column='Olympics', value=False)
for y in olympic_dates_df.index:
top_running.loc[top_running['Year'] == y, 'Olympics'] = (top_running['Year'] == y) & (top_running['Date'] >= olympic_dates_df.loc[y, 'Start']) & (top_running['Date'] <= olympic_dates_df.loc[y, 'End'])
top_running[top_running['Olympics']== True].head()
It is useful to label the top 10 performances in each event, for each gender, and for every year. This is because the data set includes the top 1000 performances for all events, and since the main concern for this analysis is the factors affecting the improvement of performances, it is worth identifying the top 10 performances in each year.
top_running.insert(loc=top_running.columns.get_loc('Time'), column='Top 10', value=False)
event_categories = ['Track_Flat', 'Road']
for gender in top_running['Gender'].unique().tolist():
for category in event_categories:
events = top_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
for year in top_running[top_running[category] == event]['Year'].unique().tolist():
# print("Category {}, Event {}, Year {}, Gender {}".format(category, event, year, gender)) #debug
best_times_per_year = top_running[(top_running[category] == event) &
(top_running['Year'] == year) &
(top_running['Gender'] == gender)]['Time'].tolist()
if len(best_times_per_year) > 0:
if len(best_times_per_year) >= 10:
cutoff = sorted(best_times_per_year)[9]
#print(cutoff) # For debugging
else:
cutoff = sorted(best_times_per_year)[len(best_times_per_year)-1]
#print(sorted(best_times_per_year)[len(best_times_per_year)-1]) # For debugging
top_running.loc[(top_running[category] == event) &
(top_running['Year'] == year) & (top_running['Gender'] == gender),
'Top 10'] = top_running.loc[(top_running[category] == event) &
(top_running['Year'] == year) &
(top_running['Gender'] == gender)]['Time'] < cutoff
else:
continue
Perform a check that this has worked:
top_running[(top_running['Road'] == 42195) & (top_running['Top 10'] == True) & (top_running['Year'] == 2012)].head()
This concludes the processing of this data set, and the data frame will be named top_running from this point on.
The full Olympic data set has information about athlete characteristics but no times or results. Both the track and field data set and the top running times data set have times and results, but no athlete data. So to answer questions about how results and athlete characteristics are related it is necessary to merge these data sets. Athletes often compete in multiple Olympic Games and in different events, so it will be necessary to find a match based on the year, the event, medal awardd and the athlete's name. It will be straightforward to match the year across both data sets, and also the events and medals, because the labels are already standardised. The name presents an additional challenge because it is written differently in each data set for some athletes appearing in both. For example, here is how Mo Farah's performance in the 10000 m in 2016 looks in the Olympic track and field data set:
ol_tf.loc[[1]]
Compare the way his name is written to the way it appears for the same performance in the full Olympic data set:
ol_running.loc[[66487]]
Row 66487 contains Mo Farah's performance matching the one in the Olympic track and field data, but the name is written very differently. To overcome this, a method called fuzzy matching will be used.
The aim is to merge the time data into the ol_running data frame, where it is available.
ol_running.head()
Add two columns to ol_running, one for the time and one for the merged-in name, which can be used as a sanity check for the data merging process.
ol_running.insert(loc=len(ol_running.columns), column='Time', value=pd.NaT)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'), column='Merged_name', value=np.NaN)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'), column='Ratio', value=np.NaN)
Now define a function to merge the times from the ol_tf_running data set into the ol_running data set. This function splits the results in each data set into groups by event, year, gender and medal awarded. This cuts the full results set into much smaller and more manageable groups. Every pass of the loop examines a pair of corresponding groups, one from each data set. Each group of results is for the same set of event, year, gender and medal awarded. The function then compares the names in each. If the strings don't match in a simple way (using str.find(), then it applies the fuzzy matching algorithm to find the best match (process.extractOne()). If the match ratio between the two strings being compared is above a threshold (chosen arbitrarily as 50) then use the two rows being compared as a match, and save the time, name and ratio in the ol_running dataset.
def merge_times(df, event_categories, debug=False):
"""
Helper function to merge times from one dataset into the ol_running dataset.
Event, year, gender, medal and athlete name are used as inputs to match athlete data from
one data frame to the same athlete's performance in the other data frame.
Names are matched using fuzzy string matching.
Input parameters:
df - data frame to merge
event_categories - list of categories of events
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby([category, 'Year', 'Gender', 'Medal'])
df_groups = df.groupby([category, 'Year', 'Gender', 'Medal'])
events = ol_running[category].unique().tolist()
events.remove(0)
for event in events:
if debug:
print(event)
for gender in ol_running['Gender'].unique().tolist():
if debug:
print(gender)
for year in ol_running['Year'].unique().tolist():
if debug:
print(year)
for medal in ol_running['Medal'].unique().tolist():
if debug:
print(medal)
try:
group_1 = ol_running_groups.get_group((event, year, gender, medal))
except KeyError:
if debug:
print("No results for this combination in ol_running_groups")
continue
try:
group_2 = df_groups.get_group((event, year, gender, medal))
except KeyError:
if debug:
print("No results for this combination in df_groups")
continue
name_options = group_1['Name'].tolist()
for name in group_2['Name']:
find_result = group_1['Name'].str.find(name)
i = find_result[find_result>-1].index
if debug:
print(i)
if(i.any()):
if debug:
print("str.find found a match: {}".format(name))
# Don't replace a time if one has already been found for this row
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
ol_running.loc[i, 'Time'] = group_2.loc[group_2['Name']==name]['Time'].tolist()
else:
if debug:
print("str.find did NOT find a match:")
best_match = process.extractOne(name, name_options)
if debug:
print(best_match)
print("Best name: {}".format(best_match[0]))
print("Match confidence: {}".format(best_match[1]))
print("index={}".format(group_1[group_1['Name']==best_match[0]].index))
if best_match[1] > 50:
i=group_1[group_1['Name']==best_match[0]].index
# Don't replace a time if one has already been found for this row
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
ol_running.loc[i, 'Merged_name'] = name
ol_running.loc[i, 'Ratio'] = best_match[1]
ol_running.loc[i, 'Time'] = group_2.loc[group_2['Name']==name]['Time'].tolist()
event_categories = ['Track_Flat', 'Hurdles', 'Road', 'Steeplechase']
merge_times(ol_tf_running, event_categories)
Check how well the fuzzy matching algorithm is doing:
ol_running.loc[~ol_running['Time'].isnull()].head()
From a visual scan of the three columns corresponding to athlete name, it looks like the fuzzy matching algorithm is doing a good job of finding the correct names. The matching algorithm uses a threshold value of 50 for the match ratio. As a further check, examine the matches with the lowest match ratio:
ol_running[ol_running['Ratio'] < 70]
So there are only seven values with a match ratio below 70, and they all look correct. So the matching algorithm seems to be working well.
ol_running.loc[~ol_running['Time'].isnull()].info()
So this method has merged in 1177 time data fields. Next, merge in times from the top_running data frame. This is a little more complicated because it is necessary to group on events marked as True in the 'Olympics' feature of this data frame to screen out other performances by the same athlete in the same year. In addition, some athletes may run several heats and a final in a single Games. Therefore, it is necessary to use the time from the race with the latest date, and within the period of the Games in question. If there turn out to be more than one (e.g., if a final and a heat were run on the same day) then we choose one arbitrarily. This is not a huge problem, since we are attempting to relate performances to height, weight and age, and those factors will not change within one day anyway.
The ol_tf_running data set contained only medal-winning performances, so it was (almost) guaranteed that there would be a corresponding row in the ol_running data set. The top_running data set differs from the ol_tf_running data set in that it contains many non-medal winning performances. Therefore, it's not possible to use the 'Medal' field to group the performances and use that to help match them. This means there is a larger scope for false positives, where the fuzzy matching algorithm wrongly identifies two similar names as a match. To help solve this, the match ratio threshold is raised from 50 to 80 in this function. It's not straightforward to combine this extra complexity into the existing merge_times function, so write a new function to handle this.
def merge_times_ext(df, event_categories, debug=False):
"""
Helper function to merge times from one dataset into the ol_running dataset.
Event, year, gender, and athlete name are used as inputs to match athlete data from
one data frame to the same athlete's performance in the other data frame.
Names are matched using fuzzy string matching.
Input parameters:
df - data frame to merge
event_categories - list of categories of events
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby([category, 'Year', 'Gender'])
df_groups = df.groupby([category, 'Year', 'Gender', 'Olympics'])
events = ol_running[category].unique().tolist()
events.remove(0)
for event in events:
if debug:
print(event)
for gender in ol_running['Gender'].unique().tolist():
if debug:
print(gender)
for year in ol_running['Year'].unique().tolist():
if debug:
print(year)
try:
group_1 = ol_running_groups.get_group((event, year, gender))
except KeyError:
if debug:
print("No results for this combination in ol_running_groups")
continue
try:
group_2 = df_groups.get_group((event, year, gender, True))
except KeyError:
if debug:
print("No results for this combination in df_groups")
continue
name_options = group_1['Name'].tolist()
for name in group_2['Name']:
find_result = group_1['Name'].str.find(name)
i = find_result[find_result>-1].index
if debug:
print(i)
if(i.any()):
if debug:
print("str.find found a match: {}".format(name))
# Don't replace a time if one has already been found for this row
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
latest_race_date = group_2.loc[(group_2['Name']==name, 'Date')].max()
ol_running.loc[i, 'Time'] = group_2.loc[
(group_2['Name'] == name) &
(group_2['Date'] == latest_race_date)]['Time'].tolist()[0]
else:
if debug:
print("str.find did NOT find a match:")
best_match = process.extractOne(name, name_options)
if debug:
print(best_match)
print("Best name: {}".format(best_match[0]))
print("Match confidence: {}".format(best_match[1]))
print("index={}".format(group_1[group_1['Name']==best_match[0]].index))
if best_match[1] > 80:
i=group_1[group_1['Name']==best_match[0]].index
# Don't replace a time if one has already been found for this row
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
ol_running.loc[i, 'Merged_name'] = name
ol_running.loc[i, 'Ratio'] = best_match[1]
latest_race_date = group_2.loc[(group_2['Name']==name, 'Date')].max()
ol_running.loc[i, 'Time'] = group_2.loc[
(group_2['Name'] == name) &
(group_2['Date'] == latest_race_date)]['Time'].tolist()[0]
event_categories = ['Track_Flat', 'Road']
merge_times_ext(top_running, event_categories)
ol_running.loc[~ol_running['Time'].isnull()].head()
ol_running.loc[~ol_running['Time'].isnull()].info()
Merging the second data set in has increased the number of rows with a time by a few hundred.
To investigate this plot athletes' results (i.e. times) against the date of the performance. This will be done individually for each event and separately for each gender. The plots use colour to identify Olympic medal winning performances (see the key). Both the ol_tf_running and top_running datasets are used for this. The analysis will follow in section 5.1.
# Group by gender
top_running_gender_groups = top_running.groupby(['Gender', 'Top 10'])
top_running_m = top_running_gender_groups.get_group(('M', True))
top_running_f = top_running_gender_groups.get_group(('F', True))
ol_tf_running_gender_groups = ol_tf_running.groupby('Gender')
ol_tf_running_m = ol_tf_running_gender_groups.get_group('M')
ol_tf_running_f = ol_tf_running_gender_groups.get_group('F')
def build_graph_labels(gender, category, event, characteristic=None):
"""
build_graph_labels
Helper function to create strings to use in constructing the graph title
Input parameters:
gender - athlete gender group
category - type of event
event - specific distance
characteristic - athlete characteristic, default None
Returns:
gender_label - Readable gender string
event_label - Readable event name string
category_label - Readable event category string
unit - Unit for the characteristic to plot
"""
if gender=='M':
gender_label = "Male"
else:
gender_label = "Female"
if category == 'Road':
if event == 42195:
event_label = 'Marathon'
if event == 21098:
event_label = 'Half Marathon'
category_label = 'Road Running'
if category == 'Track_Flat':
event_label = str(event)+'m'
category_label = 'Track (Flat)'
if category == 'Hurdles':
event_label = str(event)+'m'+' Hurdles'
category_label = 'Hurdles'
if category == 'Steeplechase':
event_label = str(event)+'m'+' Steeplechase'
category_label = 'Steeplechase'
if characteristic == 'Height':
unit='cm'
elif characteristic == 'Weight':
unit='kg'
elif characteristic == 'Age':
unit='years'
elif characteristic == 'BMI':
unit='m/kg*kg'
else:
unit=None
return gender_label, event_label, category_label, unit
def plot_times(event_categories, debug=False):
"""
Helper function to plot finish times for athletes across all events.
Input parameters:
event_categories - list of categories of events
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
global graph_number
top_running_data_present = True
ol_tf_running_data_present = True
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
if category == 'Road':
events.append(21098)
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_tf_running_gender_groups.groups.keys():
if gender == 'M':
top_running_group = top_running_m
ol_tf_running_group = ol_tf_running_m
else:
top_running_group = top_running_f
ol_tf_running_group = ol_tf_running_f
plt.figure(figsize=(18, 9))
# Plot top running time data, if it exists
try:
plt.scatter(top_running_group[top_running_group[category] == event]['Date'],
list(top_running_group[top_running_group[category] == event]['Time']),
color='b', label='Top running times')
except KeyError:
if debug:
print("No data from top running times for this event.")
top_running_data_present = False
# Plot each olympic medal colour, if data exists for this event
if ol_tf_running_group[(ol_tf_running_group[category] == event)].shape[0] != 0:
plt.scatter(pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'G')]['Year'], format='%Y'),
list(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'G')]['Time']),
color='gold', label='Olympic gold medal')
plt.scatter(pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'S')]['Year'], format='%Y'),
list(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'S')]['Time']),
color='silver', label='Olympic silver medal')
plt.scatter(pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'B')]['Year'], format='%Y'),
list(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'B')]['Time']),
color='brown', label='Olympic bronze medal')
else:
ol_tf_running_data_present = False
if debug:
print("No data from Olympic data set for this event.")
# Only plot if there is some data
if (ol_tf_running_data_present == True) or (top_running_data_present == True):
# Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(
gender, category, event)
plt.xlabel('Year')
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times".format(graph_number, gender_label, event_label))
plt.legend()
#if ol_tf_running_group[(ol_tf_running_group[category] == event)].shape[0] != 0:
plt.show()
graph_number+=1
top_running_data_present = True
ol_tf_running_data_present = True
# A label for the graphs plotted
graph_number = 1
# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_times(event_categories)
The analysis for the question "How have athletes' performances changed through history?" can be found in section 5.1.
This will be examined by plotting each of the four athlete characteristics (height, weight, age, BMI) against year of competition. Multiple events are plotted on the same axis for ease of comparison. The Numpy polyfit() method is used to plot a best fit line for each event.
The analysis can be found in section 5.2
def plot_athlete_characteristics(event_categories, characteristics, debug=False):
"""
Helper function to plot athlete characteristics (height, weight, age, BMI) against year of competition.
Multiple events are plotted on the same axis for comparison.
A best fit line is added for each event in the plot.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
meanvals = []
years = []
global graph_number
for c in characteristics:
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby([category, 'Gender'])
events = ol_running[category].unique().tolist()
events.remove(0)
for gender in ol_running['Gender'].unique().tolist():
plt.figure(figsize=(18, 9))
if debug:
print(gender)
for event in events:
if debug:
print(event)
try:
ol_running_group = ol_running_groups.get_group((event, gender))
except KeyError:
if debug:
print("No results for this combination in ol_running_groups")
continue
for year in ol_running_group['Year'].unique().tolist():
meanvals.append(ol_running_group[ol_running_group['Year'] == year][c].mean())
years.append(year)
plt.scatter(years, meanvals, label=str(event)+'m')
# Remove NaN values - these will break the fit used by polyfit() below
nullvals = np.isnan(meanvals)
for i in np.where(nullvals)[0]:
meanvals.pop(i)
years.pop(i)
z = np.polyfit(years, meanvals, 1)
p = np.poly1d(z)
plb.plot(years, p(years))
meanvals = []
years = []
#Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(gender, category, event, c)
plt.xlabel('Year')
plt.ylabel(c+'({})'.format(unit))
plt.title("Graph {0}: Variation in {1} of {2} {3} Olympic Athletes Through History".format(
graph_number, c, gender_label, category_label))
plt.legend()
plt.show()
graph_number+=1
Calculate body mass index (BMI):
ol_running.insert(loc=ol_running.columns.get_loc('Weight'), column='BMI', value=0)
ol_running['BMI'] = ol_running['Weight'] / ((ol_running['Height'] / 100)**2)
ol_running.head()
characteristics = ['Height', 'Weight', 'Age', 'BMI']
plot_athlete_characteristics(event_categories, characteristics)
The analysis can be found in section 5.2
This will be examined by plotting each of the four athlete characteristics against their performance (i.e., time) for each event and gender. Colours are used to indicate medal-winning performances as indicated by the key.
The analysis can be found in section 5.3
ol_running_gender_groups = ol_running.groupby('Gender')
ol_running_m = ol_running_gender_groups.get_group('M')
ol_running_f = ol_running_gender_groups.get_group('F')
def plot_time_vs_characteristics(event_categories, characteristics, debug=False):
"""
Helper function to plot athlete characteristics (height, weight, age, BMI) against time.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
global graph_number
data_present = True
for c in characteristics:
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_running_gender_groups.groups.keys():
if gender == 'M':
ol_running_group = ol_running_m
else:
ol_running_group = ol_running_f
plt.figure(figsize=(18, 9))
# Plot each olympic medal colour, if data exists for this event
if ol_running_group[(ol_running_group[category] == event)].shape[0] != 0:
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'G') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'G') &
(ol_running_group['Time'].notna())]['Time']),
color='gold', label='Olympic gold medal')
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'S') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'S') &
(ol_running_group['Time'].notna())]['Time']),
color='silver', label='Olympic silver medal')
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())]['Time']),
color='brown', label='Olympic bronze medal')
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'].isnull()) &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'].isnull()) &
(ol_running_group['Time'].notna())]['Time']),
color='blue', label='No medal')
else:
data_present = False
if debug:
print("No data from Olympic data set for this event.")
# Only plot if there is some data
if data_present == True:
# Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(
gender, category, event, c)
plt.xlabel(c+'({})'.format(unit))
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times with {3}".format(
graph_number, gender_label, event_label, c))
plt.legend()
plt.show()
graph_number+=1
data_present = True
Are the relationships the same if we split the data into 20 year groups?
# Create a set of year groups throughout Olympic history
year_groups = np.arange(1896, 2020, 20)
year_groups[-1] += 1 # To include 2016 Games
year_groups
def plot_time_vs_characteristics_time_groups(event_categories, characteristics, debug=False):
"""
Helper function to plot athlete characteristics (height, weight, age, BMI) against time.
This function uses colour codes to show the 20-year time period into which a performaance falls.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
global graph_number
data_present = True
for c in characteristics:
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_running_gender_groups.groups.keys():
if gender == 'M':
ol_running_group = ol_running_m
else:
ol_running_group = ol_running_f
plt.figure(figsize=(18, 9))
# Plot
if ol_running_group[(ol_running_group[category] == event)].shape[0] != 0:
for y in range(len(year_groups)-1):
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())]['Time']),
label='{0} to {1}'.format(year_groups[y], year_groups[y+1]-1))
else:
data_present = False
if debug:
print("No data from Olympic data set for this event.")
# Only plot if there is some data
if data_present == True:
# Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(
gender, category, event, c)
plt.xlabel(c+'({})'.format(unit))
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times with {3}".format(
graph_number, gender_label, event_label, c))
plt.legend()
plt.show()
graph_number+=1
data_present = True
# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_time_vs_characteristics(event_categories, characteristics)
plot_time_vs_characteristics_time_groups(event_categories, characteristics)
The analysis can be found in section 5.3
This is an analysis for the results in section 4.1
This is the analysis for the results in section 4.2.
This is the analysis for the results in section 4.3.
# What do the visualisations and data tell us? Does this answer the question?
# Are any values difficult to predict?
# Are we over/under fitting?
# Can we re-select our data set to optimise the model?
This analysis is published on GitHub (https://github.com/mattjezza/ds-proj1-t2-elite-athletics) and summarised in a post on Medium.